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Abstract 

A comparison of two english texts from Lewis Carroll, one (Alice in wonderland), 
also translated into esperanto, the other (Through a looking glass) are discussed in 
order to observe whether natural and artificial languages significantly differ from 
each other. One dimensional time scries like signals are constructed using only word 
frequencies (FTS) or word lengths (LTS). The data is studied through (i) a Zipf 
method for sorting out correlations in the FTS and (ii) a Grassberger-Procaccia 
(GP) technique based method for finding correlations in LTS. Features are com- 
pared : different power laws are observed with characteristic exponents for the rank- 
ing properties, and the phase space attractor dimensionality. The Zipf exponent can 
take values much less than unity (ca. 0.50 or 0.30) depending on how a sentence is 
defined. This non-universality is conjectured to be a measure of the author style. 
Moreover the attractor dimension r is a simple function of the so called phase space 
dimension n, i.e., r = n^, with A = 0.79. Such an exponent should also conjecture to 
be a measure of the author creativity. However, even though there are quantitative 
differences between the original english text and its esperanto translation, the qual- 
itative differences are very minutes, indicating in this case a translation relatively 
well respecting, along our analysis lines, the content of the author writing. 

Key words: Econophysic, recession, prosperity, Latin America 



1 Introduction 

Human languages are systems usually composed of a large number of internal 
components (the words, punctuation signs, and blanks in printed texts) and 



Preprint submitted to Elsevier 



28 February 2008 



rules (grammar). Relevant questions pertain to the life time, concentration, 
distribution, .. complexity of these and their relations between each others. 
Thus human language is a new emerging field for the application of meth- 
ods from the physical sciences in order to achieve a deeper understanding of 
hnguistic complexity p]l2|3|Hl5] . Language distributions, competitions, life du- 
rations, ... have become an active field of research in statistical physics indeed 
since [H[5|6|7] , where usual techniques based on non-equilibrium considerations 
[H], and agent based models are already much applied 

One should distinguish two main frameworks. On one hand, language develop- 
ments seem to be understandable through competitions, like in Ising models, 
and in self-organized systems. Their diffusion seems similar to percolation and 
nucleation-growth problems taking into account the existence of different time 
scales, for inter- and intra- effects. The other frame is somewhat older and orig- 
inates from more classical linguistics studies; it pertains to the content and 
meanings piTO] . This latter case is of interest here and the main subject of 
the report, within a statistical physics framework. 

Concerning the internal structure of a text, supposedly characterized by the 
language in which it is written, it is well known that a text can be mapped 
into a signal, of course first through the alphabet characters. However it can 
be also reduced to less abundant symbols through some threshold, like a time 
series, which can be a list of +1 and -1, or sometimes 0. Thereafter one could 
apply at this stage many techniques of signal analysis. 

In fact, laws of text content and structures have been searched for a long 
time by e.g. Zipf and others pT|12|13|14lll5] through the least effort (so called 
ranking) method. The technique is now currently applied in statistical physics 
as a first step to obtain, when they exist, the primary scaling law. It has 
been somewhat a surprise that the number of words w{h) which occurs h 
times in a text is such that w{h) ~ 1//?-'^, where 7 ~ 2, while the rank R 
of the words according to their frequency / behaves like another power law 
/ ~ where the exponent ( is quasi always close to 1.0 [16]. Some thought 
has been presented to explain so, based on constrained correlations [TTfTH] . 
Another distribution has been studied, i.e. the distribution of word lengths 
in a text. Whence two features can be looked for (i) word frequencies (FTS) 
or (ii) word lengths (LTS). We hereby consider that in physics terms they 
represent different measures of the system : the first one leads to characterizing 
the spanned phase space through a measure, - it is a static-hke, equilibrium 
approach, obtained after the text is finalized, while the second rather contains 
a time evolution aspect : it takes more time to pronounce (or read) a long 
word than a small one. Whence another technique of analysis than the Zipf 
one should be put forward. We implement the Grassberger-Procaccia (GP) 
technique for finding "time correlations" in the text through the analysis of 
LTS, as a signal spanning some attractor in a space on an a priori unknown 
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dimension. 



Obviously there are many ways to map a text onto a time series, but in the 
present study the above two series are only considered, due to their physical 
meaning which can be thought to be implied in the mapping. 

No need to recall the many communications in which a comparison of the 
properties of such "time signals" has been presented, - sometimes even (very) 
artificial so called languages [11] have been discussed, like those used for simu- 
lation codes on computers [19] . Comparison of different truly human languages 
arising from apparently different origins or containing different signs has also 
been made, e.g. beside english, one can find references about greek [20]|21f22] . 
turkish [22], Chinese j^, ... "Linguistic time series" have often studied at a 
letter or word level [25|26II27PH| or as in Montemurro and Pury [27PS] at 
a frequency mapping, similar though not identical to the one described be- 
low. Others have considered Zipf law(s) at the sentence level [291130] . - a few 
sometimes strangely neglecting the punctuation [3T|l32] . 




Esperanto is an artificially and somewhat recently constructed language [33], 
which was intended to be an easy-to-learn lingua franca. Previous statistical 
analyses seem to indicate that esperanto's statistical proportions are simi- 
lar to those of other languages [31]. It was found that esperanto's statistical 
proportions resemble mostly those of German and Spanish, and somewhat 
surprisingly least those of French and Italian. By the way, english seems to 
be an intermediary case [15]. Yet there are quantitative differences : English 
contains ca. 1 M words [35], esperanto 150 k words [36]. Other artificial lan- 
guages exist, like that of the Magma [3Z] and Urban Trad [31] music groups, 
the latter specifically designed for song competition, i.e. the eurovision contest 
[32] • Like in e.g. rap music lyrics or french verlan, the thesaurus is rather of 
limited size in all these cases. 

To my knowledge few comparisons exist on texts translated from one to an- 
other language [10]|41f42|H3] . in particular into artificial languages. We present 
below an original consideration in this respect, the analysis and results about 
a translation between one of the most commonly used language, i.e. english, 
and a relatively recent language, i.e. esperanto. 

The text to be used was chosen for its wide diffusion, freely available from the 
web j44j and as a representative one of a famous scientist, Lewis CaroU, i.e. 
Alice in wonderland (AWL) [13]. Moreover knowing the special (mathematical) 
quality of this author's mind, and some, as I thought a priori, possibly special 
way of writing, a bench mark has been chosen for comparison, i.e. Through a 
looking glass (TLG) [16]; - alas to my knowledge only available in english on 
the web [47j. Yet this will allow us to discuss whether the difference, if any, 
between esperanto and english, are apparently due to the translation or on the 
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contrary to the specificity of tliis autlior's work. It miglit be also expected tliat 
one could observe wlietlier some style or vocabulary change has been made 
between two texts having appeared at different times : 1865 and 1871, or not. 
Previous work on the english AWL version should be mentioned [15], where 
the discussion mainly pertains on corpus size effect on the validity of Zipf 
law, but where is emphasized a relevant ingredient to be taken into account in 
discussing most written texts, i.e. a mixing of oral and descriptive accounts. 

In Sect. 2, a few elementary facts and basic statistics on these texts are pre- 
sented; the methodology is briefly exposed, i.e. as one recalls (i) two simple 
ways to map texts into signals, i.e., the frequency time series (FTS) and 
the (word) length time series (LTS) , (ii) the Zipf ranking technique, (iii) 
the Grassberger-Procaccia (GP) method P8|^ used for finding correlations. 
Similar techniques for comparing english and greek texts, but not from a trans- 
lation point of view can be found in [20]; however the published work contains 
a few annoying (misprints or) defects which induces us to reformulate the 
techniques when applied to the present problem. In Sect. 3, the results are 
presented : (i) a Zipf analysis on the frequency time series (FTS), (ii) a GP 
analysis for the (word) length time series (LTS). The results are discussed in 
Sect. 4. 



2 Data and Methdology 

For these considerations two texts here above mentioned and one translation 
have been selected and downloaded from a freely available site resulting 
obviously into three files. The chapter heads have been removed. All analyses 
are carried out over this reduced file for each text. Basic statistics, like the 
number of words, the longest sentence, ... are given in Table |2] for each text, 
and chapters. A few facts attract some attention 

(1) the number of dots is much smaller in AWLeng than in AWLesp and also 
in TLGeng 

(2) automatically the longest sentence occurs in AWLeng with many more 
characters 

(3) the longest sentence in AWLesp occurs between commas 

(4) the number of semi columns is very small in TLGeng 

(5) the longest sentence ever occurs in TLGeng between semi-columns 

(6) there are very few exclamation marks in AWLesp 

(7) but a long sentence is then found between these in such a work 

(8) more importantly the number of sentences is much smaller in AWLeng 
than in AWLesp- 
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AWLeng 


AWLesp 


TLGeng 


Number of words 


27342 


25592 


30601 


Number of different words 


2958 


5368 


3206 


Number of characters 


144927 


154445 


164147 


Number of punctuation marks 


4481 


4752 


4828 


Number of "sentences" 


1633 


2016 


2059 


Words in chap. 1 


2194 


1858 




Different words in chap. 1 


652 


853 




Words in chap. 2 


2188 


1916 




Different words in chap. 2 


665 


829 




Number of dots 


979 


1545 


1315 


Longest "sentence" 


1669 


825 


864 


Number of commas 


2419 


2324 


2441 


Longest "sentence" 


373 


1170 


368 


Number of semi columns 


196 


207 


72 


Longest sentence 


6624 


6043 


12501 


Number of columns 


234 


205 


256 


Longest sentence 


4586 


5576 


3146 


Number of question marks 


203 


205 


254 


Longest sentence 


6323 


5581 


5212 


Number of exclamation marks 


451 


266 


490 


Longest sentence 


4388 


6249 


4016 



Table 1 

Basic statistical data for the three texts of interest; in each case the longest sentence 
is measured in terms of the number of characters (not in terms of words) 




Fig. 1. Zipf (log-log) plot of the frequency of words in the three texts of interest 

AWLeng) AWLesp and TLGeng- The usual (C = 1) exponent is indicated. A Zipf- 
Mandelbrot law fit for 2 < i? < 1000 is not shown but is discussed in the main text; 
see also table III 

Let us now search for correlations in the texts through both ways of construct- 
ing a time series from such documents of e.g. M words: 

(1) Count the frequency / of appearance of each word in the document. 
Rewrite the text such that at each "appearance" of a word, the word is 
replaced by its frequency such that one obtains a time series f{t). Such 
a time series is called a "frequency time series" (FTS). 
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Fig. 2. Zipf (log-log) plot of the frequency of words in (a) chapter 1, (b) chapter 
2, for the texts of interest, i.e. AWLeng and AWLesp- The "usual" 1 exponent is 
indicated. A Zipf-Mandelbrot law fit for R < 200 is not shown but is discussed in 
the main text; see also table III 

(2) Count the number / of letters of each word located in the text successively 
at the time t= 1, for the first word, at time t = 2, for the second, etc. 
Construct a time series l(t). Henceforth, such a time series is called a 
length time series (LTS). 

When applied e.g., to economic (financial) signals [HI|51f52|[S3] . each frequency 
/ and word length / are analogous to the price of a share or the volume 
of a transaction. A (scaling or) power law is then often observed, i.e. when 
correlated sequences exist, leading to Cm,fe values quite different from 1. 

These two sorts of time series are thereby analyzed along one of the two 
mentioned techniques, one being more pertinent than the other as outlined 
here above. Let us discuss them briefly. 

2. 1 Zipf method 

A large set of references on Zipf's law(s) in natural languages can be found in 
[5^ . The idea has been applied to many various complex signals or "texts", 
- signals, translated through a number k of characters characterizing an al- 
phabet, like, among many others, for time intervals between earthquakes [55] , 
DNA sequences [56] or for financial data j50ll51|l52j along the lines of econo- 
physics. 

The (FTS) Zipf original method (though see [13]) [TTll57|58] examines the 
probability distribution of words in spoken (more exactly written) languages. 
Zipf calculated the number N of occurrences of each word in a given text. By 
sorting out the words according to their frequency /, i.e. N measured with 
respect to the total number M of words in the text, a rank R can be assigned 
to each word, with i? = 1 for the most frequent one. For any natural languages. 
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AWLpsp 



10" 



102 



10^ 



10" 



101 



102 



103 



Fig. 3. Zipf (log-log) plot of the FTS of sentence lengths, as separated by (a) dots, 
(b) commas, in the three texts of interest AWLengi AWLesp and TLGeng- The 
0.33 exponent of the corresponding Zipf law is indicated as a guide to the eye. A 
Zipf-Mandelbrot law fit for R < 200 is not shown but is discussed in the main text; 
see also Table III 




Fig. 4. Zipf (log-log) plot of the FTS of sentence lengths, as separated by (a) semi- 
colums, (b) columns, in the three texts of interest AWLeng) AWLesp and TLGeng 
. The 0.50 exponent of the corresponding Zipf law is indicated as a guide to the eye. 
A Zipf-Mandelbrot law fit for various R ranges is not shown but is discussed in the 
main text; see also Table III 

one observes a power law for the rank distribution 



(1) 



with an exponent C close to unity. The occurrence of this power law has al- 
ready been suggested [60j to be due to the "hierarchical structure" of the 
text as well as the presence of long range correlations (sentences, and logical 
structures therein). This strong quantitative statement with ubiquitous appli- 
cability is attested over a vast repertoire of human languages [I6]. Yet it is of 
empirical evidence that Zipf s law in this (FTS) form can at most account for 
the statistical behaviour of words frequencies in a zone spanning the middle- 
low to low range of the rank variable. Even in the case of long single texts 
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Fig. 5. Zipf (log-log) plot of the FTS of sentence lengths, as separated by (a) 
exclamation points, (b) question marks, in the three texts of interest AWLeng, 
AWLesp and TLGeng- The 0.50 exponent of the corresponding Zipf law is indicated 
as a guide to the eye. A Zipf-Mandelbrot law fit for various R ranges is not shown 
but is discussed in the main text; see also Table III 

Zipf s law renders an acceptable ( in the small window between s ~ 10 and 
1000, which does not represent a significant fraction of any literary vocabulary. 
However power laws lead to valuable insights into statistical processes, since 
they imply no scaling, whence some hierarchical structure. The ( exponent, or 
more generally the exponent of such a power law, can be turned into a fractal 
dimension (or Hurst exponent) interpretation as in [61] . 

One difficulty stems in the lower and upper ranks of such plots because of the 
abundance and rarity of words [62]. Mandelbrot [63II64|I65] using arguments 
based on fractal ideas, applied to the structure of lexical trees, improved the 
original form of the law, writing, in terms of two parameters A and C that 
need to be adjusted to the data. 



The latter form is thought to be more adequately valid for many sorts of data 
in the region corresponding to the lowest ranks, that is R < 100, dominated 
by mostly (small) function words. In the same spirit one can show that w{h) ~ 
[66j; whence 7 = 1 + z/, with u = l/( or (* [671168] . We do not discuss 
further the validity of Zip law(s) for which there is an abundant literature 

It has been shown that this Zipf-Mandelbrot law is also obeyed by so many 
random processes [S^fTO] that one has been sometimes ruling out any interest- 
ingly special character for linguistic studies. Nevertheless, it has been argued 
that it is possible to discriminate between human writings [71] and stochastic 
versions of texts precisely by looking at statistical properties of words that fall 
where Eq.([T]) does not hold ^20j. Whence still some question cannot be avoided 
on artificial languages, translations, and on effects resulting from automatic 
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f 


vv -Liesp 


f 
J 


the 


1527 


la (—the) 


2070 


and 


802 


kaj (—and) 


628 


to 


725 


s i (—she) 


508 


a 


615 


ne ( — no/not) 


426 


I 


545 


mi (=1) 


403 


it 


527 


Alicio (—Alice) 


347 


she 


509 


diris { — said) 


332 


of 


500 


al (=to) 


313 


said 


456 


vi {=you) 


302 


Alice 


395 


kc (—that) 


292 



Table 2 

Top ten most frequent words in AWLeng and AWLesp with their frequency 




Fig. 6. Grassberger-Procaccia (log-log) plot of the correlation integral C„(l) as a 
function of the correlation length / in different phase space dimension n, see text, 
in the two texts of interest (a) AWLeng and (b) AWLesp 

or machine translations [43] • 

Flipping the horizontal and vertical axes of the log-log plot suggested by Eq. 
(1) the cumulative probability distribution function (cPDF) P{f) of the quan- 
tities of interest obeys 



P{> f) ~ /'-^ (3) 

where 1 — is a characteristic power law exponent for the cPDF. Whenceforth, 
77-1 = 1/C; i.e. p{f)-n. 

Note that in the following the length of sentences is also examined from the 
point of view of the numbers of characters between (six sorts of) punctua- 
tion marks, see Table 1. One also distinguishes between the first and second 
chapter, thereby allowing for some consistency test. 
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2.2 Grassberger-Procacia method 



In order to get an insight into the dynamics of a system solely from the knowl- 
edge of the time series, a method derived by Grassberger and Proccacia [i8E9] 
has been proven to be particularly useful. This method has been applied to 
analyze the dynamics of neural network activity [72], electric activity of semi- 
conducting circuits [75|71] . climate [7S] , etc. 

We aim to finding some answer to questions like 

(1) Can the salient features of the system be viewed as the manifestation 
of a deterministic dynamics, or do they contain an irreducible stochastic 
element? 

(2) Is it possible to identify an attractor in the system phase space from a 
given time series [76] ? 

(3) If the attractor exists, what is its dimensionality r [77] ? 

(4) What is the (minimal) dimensionality n of the phase space within which 
the above attractor is embedded |i78i? 



This defines the minimum number of variables that must be considered in the 
description (through some model) of the system. 

This is done as follows: Let the LTS time series having M data points, i.e. t/i 
[i = 1, . . . , M). Consider the data as illustrating some dynamical process in a 
(phase) space with dimension n. Construct a set of V vectors Vk {k = 1, . . . ,V) 
containing n — 1 points as follows: 



where r is an integer, called the delay time. Notice : V + {n — 1) = M. In 
other words, one considers + nr as a sum modulo M. Next one estimates 
the correlation integral from the distance \vi — Vj \ between all the vectors such 
that 1 < i,j < V. The correlation integral Cn{l) is obtained from 




'Vk = iVk, Vk+Tj 2/fc+2T, • • • , yk+(n-l)T) 



(4) 



# of pairs («, j) such that \vi — Vj\ < I 



(5) 



Ar2 



In other words. 



# of pairs {i,j > i) such that \vi — Vj\ <l 



(6) 



iV(A^- l)/2 
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GP have shown that for small /, one has 

~ BV. 



(7) 



where B is some constant and r is the so called attractor (correlation) di- 
mension, measuring the number of dynamic variables or number of degrees 
of freedom. In order to obtain r for the different n values, a log-log plot is 
in order. The choice of r is debatable [7S]. In the following we have chosen 
r = 500 like in other related studies [20], for n = 1 to 15. 

Practically, it was noticed that the correlation integral calculated for \vi < 
Vj\ distances takes a finite number of values; therefore each distance / was 
"measured" up to three decimal digits. Therefore two distances differing by 
less than 0.001 are not differentiated. Even though we have not tested the 
robustness of this "numerical approximation", we have not the impression 
that it is a drastic one. 

A fit of the beginning of the C„(/) evolution through the best mean square 
technique on a log-log plot leads to a value of the relevant slopes, thus r defined 
by Eq. 0. 



3 Results 

3. 1 Zipf plots 

The result of the FTS analysis for the three main texts is shown in Fig. [l| 
Figs. [2]^a-b) show the (frequency, rank) relation for the esperanto and english 
chapter 1 and 2 respectively of both AWL texts. Each log-log plot roughly 
indicates a linear relationship, for i? > 10, thus a ( exponent close to unity, 
as often found, in usual literature. Some curvature is found for all texts below 
i? ~ 15 where a so called discontinuity exists, explained by [15] as due to 
a transition between colloquial ("common") small and "distinctive" words. 
Some break, or change in slope, is also found ca. 100, - see discussion in [T5] . 
More interestingly let it be observed that the Rank =1 for the esperanto text 
is much higher than for the english texts. Moreover the variety of distinct 
words is larger in esperanto as well. In between the number of words is less 
frequent in general, indicating a greater simplicity in vocabulary. The same is 
true whatever the chapter considered. 

The top ten most frequent words in AWLeng and AWLesp are given with 
their frequency in Table |2.1[ It seems of interest to point out differences in 
style appearing from such a table. Notice that a translation does not conserve 
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J? ig. 


Text 


A 




>* 














. . . < H < . . . 


1 


AWLcng 
AWLesp 
TLGeng 


1177 
962 
1098 


0.17 
0.28 
0.13 


1.15 
1.01 
1.21 


2 1000 

2 1000 

2 1000 


2a 


AWLeng 
AWLesp 


116 
48 


0.19 
0.15 


1.16 
0.01 


1... ... 200 

4 200 


2b 


AWLcng 
AWLesp 


118 
168 


0.24 
0.90 


1.07 
0.92 


1 200 

2 200 


3a 


AWLcng 
AWLesp 


1062 
984 
1029 


0.08 
0.5 
0.46 


0.55 
0.36 
0.34 


4 200 

1 200 

1 200 


3b 


AWLcng 
AWLesp 
TLGeng 


366 
1019 
382 


0.09 
1.2 
0.11 


0.33 
0.34 
0.27 


1 200 

4 200 

1 200 


4a 


AWLcng 
AWLesp 
TLGeng 


4650 
6978 
13128 


0.06 
0.14 
0.02 


1.15 
0.97 
4.73 


2 ... ... 100 

1 100 

1 ... ... 60 


4b 


AWLeng 
AWLesp 
TLGeng 


5068 

5645 
3296 


0.34 
0.3 
0.05 


0.59 
0.6 
0.94 


1 ... ... 100 

1 100 

1 ... ... 100 


5a 


AWLcng 
AWLesp 
TLGeng 


2756 
3630 
4777 


0.03 
0.02 
0.39 


1.48 
1.91 
0.61 


4 200 

3 200 

1 200 


5b 


AWLcng 
AWLesp 
TLGeng 


3283 
4099 
5504 


0.01 
0.02 
0.13 


3.61 
1.89 
0.84 


4 ... ... 100 

2 ... ... 100 

1 ... ... 100 



Table 3 

Values of parameters for the Z-M fit, Eq. (2) 
are indicated together with the fit range 



the corresponding figure and texts 



the number of words in a text, nor their importance in frequence. Of course 
the ranking might be intrinsically different, but also the translator can modify 
some sentences according to the language and grammar. An interesting illus- 
tration is noticed in Table 2.1 : the same words 'the' = 'la' and 'and'='kaj' are 
both times the most frequent, but for example 'Alice' occurs more frequently 
in the english text than in the esperanto text, same for 'I'='mi', even though 
'she '= 'si ' occurs an equal number of times. 

As indicated in the main text, one can look at the length of sentences, in the 
case of the three main texts. Figs. [3]j5j The relevant separators are mentioned 
in the figure captions. A marked difference occurs between the cases "." (dot) 
and "," (comma) on one hand and the others, column, semi-column, exclama- 
tion point, question mark, i.e., "!", In the first group, the slope is 
rather close to 1/3, but is closer to 1/2 for the latest four cases. The number 
of punctuation marks is relatively equivalent (see Table |2]) in all texts, but 
the number of dots and commas are much larger than the other punctuation 
marks. Therefore one might expect some finite size effect. Whence it is of 
interest to test Eq.(|2]), and compare the parameter values, given in Table 3.1 



From the figures one can notice that three exponents seem to characterize the 
rank law 1.0, 1/3, 1/2. The unity is usual. To find low values like 0.5 and 0.33 
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is more rare. Let us refer to a case in which it was striking to find a value 
ca. 0.55 and 0.72, i.e. the case of city size, in 1600 and 1990 respectively, and 
0.74 for firm sizes with more than 10 employees in 1997, in Denmark [80] , 
in contrast to the usual 1.0 value [8Tj. A value smaller than unity indicates 
a more homogeneous repartition of the variables (words, here). One can see 
some analogy between city and firm sizes from the point of view of flow in 
(and out) of citizens or assets. Whence a Gabaix |81] or Simon [82] model can 
be thought of to understand the values found here. E.g. Gabaix claims that 
two causes can lead to a value less than 1.0, i.e. either : 

(1) the mean or variance of the growth process deviates from Gibrat's law 
[53] . i.e. the growth rate is independent of the size, or 

(2) the variance of the growth process depends is size-dependent. 

Recalling that one does not examine the "growth" of the text at this stage yet, 
nor have any model for doing so presently, - except that of Simon [H2] (words 
not yet used are added at a constant rate, while words already used are inserted 
at a frequency depending of the previous number of occurrences; this leads to 
Zipf law; thus the rate of appearance of new words in fact decreases as the text 
length increases) , one can nevertheless agree that the sample size is relevant for 
finding a small ( value. Indeed it is clear that the found values correspond to 
the length of sentences which are defined through various punctuation marks, 
counting characters rather than words. Several orders of magnitude in the 
maximum rank distinguish the cases. What is still surprising is why the longest 
sentences, thus defined through dots and commas lead to a smaller values than 
for other punctuation marks which lead to less frequent sentences. 

An alternate view can be taken through the Z-M analytical form, Eq.(2). 
Values of the parameters are given in Table 3 for various ranges R. it is fair 
to state that the parameters are NOT very robust with respect to the range. 
Values of (* can be found close to the apparently best looking slope, (, but 
other values can be found as well. This is due to the strong infiuence of the low 
rank points. The paradoxical situation occurs when one remembers that the 
analytical form is supposed to be used in order to take into account the finite 
value at i? = 1. However the curvature for the (small) function words markedly 
infiuences the outcome. In order to illustrate the point, a brief example is given 
in an Appendix. 



3.2 Grasseberger-Procaccia plots 

The analysis of correlation integrals allow to estimate whether the number of 
degrees of freedom (of a process) is large or reasonably small. It seems that 
the usual goal is rather qualitative. However it pertains to the fundamental 
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Text 


Slope r(n) 


Standard deviation 


AWLcng 


0.84 


0.01 


AWLesp 


0.747 


0.004 


TLGeng 


0.77 


0.01 


Global 


0.79 


0.01 



Table 4 

Measured slopes of the linear function r(n) 




( uiTclation Iciififh / 



Fig. 7. Grassberger-Procaccia (log-log) plot of the correlation integral Cn{l) as a 
function of the correlation length / in phase spaces with different dimensions (n) for 
TLGeng 

question on noisy signals, - is it noise or chaos? As explained here above the 
algorithm is based on the statistics of pairwise distances for an arbitrary choice 
of the delay time. Therefore the output of the method results in observing an 
evolution of correlations, i.e. in the knowledge on how often a point in some ((= 
"the") phase space is found near another, whence illustrating some dynamical 
features connecting local and global features. 

The three sets of correlation integrals, calculated following the method here 
above described, are shown in Figs. 7-8. The slopes can be summarized through 
a graph relating r and n (Fig. 9). It is found that the attractor dimension n 
is not only smaller than the space dimension, as it should be [7S], but also is 
a linear function on a log- log plot of the so called phase space dimension r, 
for the three texts of interest. A remarkable power law is found, whatever the 
text, r = n'^, with A = 0.79, which does not indicate any saturation. It seems 
of great interest to examine other authors and to find whether A characterizes 
some style or author or .... 
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Fig. 8. Attractor dimension n as a function of the so called phase space dimension 
r for the three texts of interest. Notice that a linear relationship is found with a 
proportionality coefficient A= 0.79 on a log-log plot, as for r = 

4 Conclusion 



At first sight, a time series of a single variable appears to provide a limited 
amount of information. We usually think that such a series is restricted to a 
one-dimensional view of a system, which, in reality, contains a large number 
of independent variables. On one hand FTS and LTS result from a dynamical 
process, which is usually first characterized by its fractal dimension. The first 
approach should contain a mere statistical analysis of the output, as done 
through a Zipf like analysis. It can be found that analytical forms, like power 
laws with different characteristic exponents for the ranking properties exist. 
The Zipf exponent can take values ca. 1.0, 0.50 or 0.30, depending on how 
a sentence is defined. This non-universality is conjectured to be a measure 
of the author style. Another approach through a Zipf-Mandelbrot law seems 
unreliable due to the (present lack) of distinction most likely between function 
and determining words, and breaks occurring in the f{R) plots. Something 
which has not been examined and is left for further studies is the distinction 
between oral-like and descriptive parts of a tgext and its translation. 

Moreover a time series is known [I8]l49l)75j to bear the marks of all other 
variables participating in the dynamics of the system. Thus one is able to 
reconstruct the systems phase space from such a series of one-dimensional 
observations. When applying the Grassberger-Proccacia (GP) method to a 
physics time series one wants to know whether the attractor is based on a finite 
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set of variables. The lack of saturation found here through the law r = n for 
the size of the attractor indicates that the writing of a text by some creative 
author can be hardly reduced to a finite set of differential equations ! Yet the 
analytical form suggests to examine whether A characterizes an author style 
or creativity, and how robust its value can be. 

Finally, as in [20] we concur that the application of GP analysis indicates 
that linguistic signals may be considered as the manifestation of a complex 
system of high dimensionality, different from random signals or systems of low 
dimensionality such as the Earth climate or financial signals. 

Last but not least as on comparing AWLeng, AWLesp , and TLGeng, with 
both the "static" and "dynamic" methods, it seems that the texts are quali- 
tatively similar, which indicates ... the quality of the translator. In this spirit, 
it would be interesting to compare with results originating from text obtained 
through a machine translation, as recently studied in \8^. It is of huge interest 
to see whether a machine is more flexible with vocabulary and grammar than 
a human translator, - see also [43j ! 
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5 Appendix 

The Zipf-Mandebrot, Eq.(2), law is thought to be useful for better describing 
the ranking function f{R), in particular in order to take into account the 
finite value of / at i? ~ 1. Yet from data presented in Table 3, it can be 
observed that the parameters, in particular (* is far from robust when the 
range of R slightly varies. For example (* can vary from 0.84 to 3.61 when 
only the fit range is slightly changed, like in the case of Fig. 5b, for sentences 
limited by question marks in the three original texts where one expects an 
exponent near 0.5. It appears that if one fits from R= 1 one finds (*=0.65, 
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AWLong 


parameter 


value 


absolute error 


Ranj 


5C 


: from 


2 to 200 


A 


2104.5665 


102.1277 










C 


1.2151 


0.2201 










c* 


0.3924 


6.I6680-3 


Ran, 


!e 


: from 


2 to 200 


A 


1239.7700 


18.5571 










C 


0.1553 


1.12550-2 










c* 


0.4874 


7.7509e-3 


Ranj 


5C 


: from 


3 to 200 


A 


1105.3531 


10.7540 










c 


9.4509O-2 


4.73420-3 










c* 


0.5334 


6.9828e-3 


Ranj 


5C 


: from 


4 to 200 


A 


1061.8008 


9.7680 










C 


7.9410e-2 


3.76630-3 










c* 


0.5526 


7.12480-3 



Table 5 

Effect of low ranking points on Z-M fit; parameter values and their corresponding 
error bar for AWLeng sentences limited by "dots" 



but from R= 2 , C*=1.68, from /2=3, C*= 2.68, and from R=A C=3.61, as 
shown in Table 3. This is "obviously" due to the curvature of the data at low 
R. Some other example pointing to the probable origin of the fit parameter 
value instability, in AWLeng defined through dots in Fig.l is given in Table 
4, where a few results of small changes in ranges, removing one, two, three 
or four first points, and the corresponding parameter fits are given with the 
(absolute) error bars. 

I have not found much discussion of the matter in the literature, maybe be- 
cause either the case is not frequent, or not examined. See nevertheless [85] 
where it is suggested that C,* be interval dependent and increasing logarithmi- 
cally with R. In the present case, it appears that one can consider the origin 
of the instability to reside in the" large" variations of f{R) at small R. In fact 
the curvature of f{R) changes from convex to concave at small R. This leads 
to an instability in the set of least mean square fits. This, in other words, is 
due to the number of regimes, changes in curvature, in the data. Powers [15] 
(and later others like [59]) had already noticed that one should distinguish 
between small (function) words and large (determining) words, and pointed 
to the break, or change in slope at finite R (~100). A recommendation is in 
order : a visual scan of the data should be made before attempting a fit with 
Eq.(2), in order to observe the number of regimes, or the number of crossover 
points, which might appear in the data. It is also of course useless to attempt 
a fit with many more parameters, - one would need at least three per regime! 
Yet the understanding of the position of the crossover points might be of in- 
terest. Recall the remarkable papers on the position of the cross over points 
in detrended fluctuation analysis studies [86|87] . related to a periodic back- 
ground or trend in time series. Such considerations would illuminate in the 
present context, the language quality level or an author style and creativity 
through a text "background" content.. 
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